state-action tuple
version of our paper, we shall clarify the details in Section 3 (R2), and make intuition in the methods section much
We thank the reviewers for the detailed comments, suggestions, and a positive assessment of our work. We will correct for color schemes in all figures (R1). We have also made captions of figures cleaner (R3). We have added a description of the setup to the paper. In Fig 5 (left), DisCor actually outperforms Unif( s,a) on these environments.
LDM$^2$: A Large Decision Model Imitating Human Cognition with Dynamic Memory Enhancement
Wang, Xingjin, Li, Linjing, Zeng, Daniel
With the rapid development of large language models (LLMs), it is highly demanded that LLMs can be adopted to make decisions to enable the artificial general intelligence. Most approaches leverage manually crafted examples to prompt the LLMs to imitate the decision process of human. However, designing optimal prompts is difficult and the patterned prompts can hardly be generalized to more complex environments. In this paper, we propose a novel model named Large Decision Model with Memory (LDM$^2$), which leverages a dynamic memory mechanism to construct dynamic prompts, guiding the LLMs in making proper decisions according to the faced state. LDM$^2$ consists of two stages: memory formation and memory refinement. In the former stage, human behaviors are decomposed into state-action tuples utilizing the powerful summarizing ability of LLMs. Then, these tuples are stored in the memory, whose indices are generated by the LLMs, to facilitate the retrieval of the most relevant subset of memorized tuples based on the current state. In the latter stage, our LDM$^2$ employs tree exploration to discover more suitable decision processes and enrich the memory by adding valuable state-action tuples. The dynamic circle of exploration and memory enhancement provides LDM$^2$ a better understanding of the global environment. Extensive experiments conducted in two interactive environments have shown that our LDM$^2$ outperforms the baselines in terms of both score and success rate, which demonstrates its effectiveness.
- Asia > Middle East > Republic of Türkiye (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
- Research Report (0.84)
- Overview (0.68)
CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning
Yue, Sheng, Wang, Guanbo, Shao, Wei, Zhang, Zhaofeng, Lin, Sen, Ren, Ju, Zhang, Junshan
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the "exploitation" (on both expert and diverse data) and "exploration" (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right "exploitation-exploration" balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning (source code). The primary objective of Inverse Reinforcement Learning (IRL) is to learn a reward function from demonstrations (Arora & Doshi, 2021; Russell, 1998). In general, conventional IRL methods rely on extensive online trials and errors that can be costly or require a fully known transition model (Abbeel & Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Syed & Schapire, 2007; Boularias et al., 2011; Osa et al., 2018), struggling to scale in many real-world applications. To tackle this problem, this paper studies offline IRL, with focus on learning from a previously collected dataset without online interaction with the environment.
- North America > United States > Arizona (0.04)
- North America > United States > Ohio (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- Asia > China (0.04)